USE 337 - application level batching for embeddings#37
Merged
Conversation
ghukill
commented
Jan 20, 2026
Comment on lines
+68
to
+71
| for embedding_inputs_batch in batched(embedding_inputs, batch_size): | ||
| logger.debug(f"Processing batch of {len(embedding_inputs_batch)} inputs") | ||
| for embedding_input in embedding_inputs_batch: | ||
| yield self.create_embedding(embedding_input) |
Collaborator
Author
There was a problem hiding this comment.
This update to the testing model fixture shows very simply and directly how batching is applied.
Why these changes are being introduced: There are two levels at which "batching" may come into play: 1. Many ML models have internal batching. You can provide 100 in an array, but it might only create embeddings or 10 at a time. However, sometimes you pay the full memory pressure weight of the original 100. 2. Our `create_embeddings()` wrapper method can send batches to the ML embedding model method. Because the input `embedding_inputs` is an iterator, this ensures that we keep memory pressure low even for large numbers of records to embed. We need to keep memory pressure low as we move into supporting local, Fargate, and GPU contexts for embedding creation. How this addresses that need: We have introduced batching at the `create_embeddings()` method layer, leveraging the records iterator that is used as input. In doing so, we ensure that the ML model is only seeing small(ish) batches for create embeddings for. To reiterate, this is distinct from the ML model itself which may have some internal batching. For example, we may send a batch of 100 to the model, but it might still create embeddings via internal batching of 2-5 records. Side effects of this change: * Confirmed low memory usage locally, Fargate ECS, and GPU backed EC2 contexts for runs up to 10k records. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-337
b57c988 to
6bdfea4
Compare
ehanson8
approved these changes
Jan 20, 2026
ehanson8
left a comment
There was a problem hiding this comment.
Works as advertised, code looks good to me, approved!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose and background context
The timdex-embeddings app needs to support local, Fargate ECS (cpu), and EC2 (gpu) compute environments. Each have their own knobs and dials to turn when it comes to performance and resources. One area we will encounter issues with is memory consumption if we’re not careful.
The model
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte, currently are only supported model, has a “batch size” configuration. This configures how many records are processed at once to create embeddings, which can effect performance, but does not have much effect on memory management. This is partially because a fully materialized list of records is required to be passed to the model, which effectively pull them all into memory.Take the example of a job that wants to create 10k embeddings. We have an iterator of those embeddins that TDA will pull in a memory safe fashion, but we cannot materialize them all into memory. Even if we could, passing them all to the model for embeddings would likely blow up memory even though the “batch size” is 2-5 records.
What we need is a higher level batching layer, at our application layer, that manages the iterator of input records, passes memory safe batches to the model, writes the results, then is fully done with those records.
This PR introduces batching at the application layer, ensuring that even large jobs are completed successfully.
The method
create_embeddings()requires an iterator ofEmbeddingInput's to embed. By sending batches of these to the ML model, then writing out the results, we are completely done with them and we move onto the next batch, thereby managing memory.How can a reviewer manually see the effects of these changes?
1- Run
make installfor updated dependencies.2- Set Dev1 AWS credentials in terminal.
3- If not done already, download model:
4- Create embeddings with a small batch size:
Some logging analysis:
opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gtedoes not benefit from large batches. Even that internal model batching of 4 may get bumped down to 1-2 at some point. But the application batching of 3 ensures that memory consumpt is low.Local inference is slow enough that it's tricky to test larger runs and see that memory consumption is low, but can confirm that GPU runs are successful at 1k, 5k, even 10k, which formerly exceeded memory.
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: memory management is safe locally, Fargate ECS (cpu), and EC2 (gpu)
What are the relevant tickets?
Code review